Today we will…
Functions allow you to automate common tasks!
Writing functions has 3 big advantages over copy-paste:
Let’s define the function.
add_two <-The name of the function is chosen by the author.
The argument(s) of the function are chosen by the author.
If we supply a default value when defining the function, the argument is optional when calling the function.
something defaults to 2.{ }The body of the function is where the action happens.
return()Your function will give back what would normally print out…
When a function requires an input of a specific data type, check that the supplied argument is valid.
add_something <- function(x, something){
stopifnot(is.numeric(x),
is.numeric(something)
)
return(x + something)
}
add_something(x = "dog", something = 2)Error in add_something(x = "dog", something = 2): is.numeric(x) is not TRUE
Error in add_something(x = 2, something = "something"): is.numeric(something) is not TRUE
add_something <- function(x, something){
if(!is.numeric(x)){
stop("Please provide a numeric input for the x argument.")
}
return(x + something)
}
add_something(x = "statistics", something = 5)Error in add_something(x = "statistics", something = 5): Please provide a numeric input for the x argument.
add_something <- function(x, something){
if(!is.numeric(x) | !is.numeric(something)){
stop("Please provide numeric inputs for both arguments.")
}
return(x + something)
}
add_something(x = 2, something = "R")Error in add_something(x = 2, something = "R"): Please provide numeric inputs for both arguments.
The location (environment) in which we can find and access a variable is called its scope.
We cannot access variables created inside a function outside of the function.
Name masking occurs when an object in the function environment has the same name as an object in the global environment.
Functions look for objects FIRST in the function environment and SECOND in the global environment.
It is not good practice to rely on global environment objects inside a function!
You will make mistakes (create bugs) when coding.
print() debugging
print() statements throughout your code to make sure the values are what you expect.When you have a concept that you want to turn into a function…
Write a simple example of the code without the function framework.
Generalize the example by assigning variables.
Write the code into a function.
Call the function on the desired arguments
This structure allows you to address issues as you go.
Write a function called find_car_make() that takes in the name of a car and returns the “make” of the car (the company that created it).
find_car_make("Toyota Camry") should return “Toyota”.find_car_make("Ford Anglica") should return “Ford”.You will write several small functions, then use them to unscramble a message. Many of the functions have been started for you, but none of them are complete as is.
Today we will…
We wrote a function called find_car_make() that takes in the name of a car and returns the “make” of the car (the company that created it).
find_car_make("Toyota Camry") returns “Toyota”.find_car_make("Ford Anglica") returns “Ford”.dplyrConsider the mtcars data.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Let’s use our new function:
mtcars |>
rownames_to_column("make_model") |>
mutate(make = find_car_make(make_model),
.after = make_model) |>
head(n = 3) make_model make mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 Mazda 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag Mazda 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 Datsun 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
penguins data# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
We want to take in a vector of numbers and standardize it – make all values be between 0 and 1.
dplyrLet’s standardize penguin measurements.
penguins |>
mutate(bill_length_mm = std_to_01(bill_length_mm),
bill_depth_mm = std_to_01(bill_depth_mm),
flipper_length_mm = std_to_01(flipper_length_mm),
body_mass_g = std_to_01(body_mass_g)
) |>
head()# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 0.255 0.667 0.153 0.292
2 Adelie Torgersen 0.269 0.512 0.237 0.306
3 Adelie Torgersen 0.298 0.583 0.390 0.153
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 0.167 0.738 0.356 0.208
6 Adelie Torgersen 0.262 0.893 0.305 0.264
# ℹ 2 more variables: sex <fct>, year <int>
Ugh. Still copy-pasting!
dplyrRecall across()!
penguins |>
mutate(across(.cols = bill_length_mm:body_mass_g,
.fns = ~ std_to_01(var = .x)
)
) |>
head()# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 0.255 0.667 0.153 0.292
2 Adelie Torgersen 0.269 0.512 0.237 0.306
3 Adelie Torgersen 0.298 0.583 0.390 0.153
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 0.167 0.738 0.356 0.208
6 Adelie Torgersen 0.262 0.893 0.305 0.264
# ℹ 2 more variables: sex <fct>, year <int>
Is it a good idea to scale (standardize) variables in a data analysis?
Why scale?
Why not scale?
E.g., a penguin with a bill length of 35 mm (std to 0.11) and a mass of 5500 g (std to 0.78).
Note
I used the existing function std_to_01(var) inside the new function for clarity!
Functions using unquoted variable names as arguments are said to use nonstandard evaluation or tidy evaluation.
Tidy evaluation isn’t naturally supported when writing your own functions.
When a piece of code is defused, R doesn’t return its value like normal.
We produce defused code when we use tidy evaluation and our own functions don’t know how to handle it.
Don’t use tidy evaluation in your own functions.
This is more complicated to read and use, but it’s safe.
Use embrace injection.
rlang package provides the embrace operator ({{ }}) to simplify writing functions around tidyverse pipelines.{{ }} operator, you can transport a variable from one function to another and can get around defused code!enquo(arg) to difuse and !!arg to inject.std_column_01 <- function(data, variable) {
stopifnot(is.data.frame(data))
data <- data |>
mutate(variable = std_to_01(variable))
return(data)
}
std_column_01(data = penguins, variable = body_mass_g)Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error:
! object 'body_mass_g' not found
mutate() doesn’t know what body_mass_g is.enquo(variable) and inject !!variablebody_mass_g variable using {{ }}!:=When we use the embrace operator, we also have to use the walrus operator – := instead of ==.
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <dbl>
1 Adelie Torgersen 39.1 18.7 181 0.292
2 Adelie Torgersen 39.5 17.4 186 0.306
3 Adelie Torgersen 40.3 18 195 0.153
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 0.208
6 Adelie Torgersen 39.3 20.6 190 0.264
# ℹ 2 more variables: sex <fct>, year <int>
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <dbl>
1 Adelie Torgersen 39.1 18.7 181 0.292
2 Adelie Torgersen 39.5 17.4 186 0.306
3 Adelie Torgersen 40.3 18 195 0.153
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 0.208
6 Adelie Torgersen 39.3 20.6 190 0.264
# ℹ 2 more variables: sex <fct>, year <int>
What if I want to modify multiple columns?
across()!# A tibble: 5 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 0.255 0.667 0.153 0.292
2 Adelie Torgersen 0.269 0.512 0.237 0.306
3 Adelie Torgersen 0.298 0.583 0.390 0.153
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 0.167 0.738 0.356 0.208
# ℹ 2 more variables: sex <fct>, year <int>
Article on How Building Functions with Variable Names has Changed Over the Years
Consider a study of depression.
We implicitly assume observations are missing completely at random!
We need to take more care when dealing with missing values!